Project: SRP201470

Aligner: STAR (2.7.6a)

Genome: For mouse, the UCSC mm10 assembly available in iGenomes was used.

Informatics tools used:

  • Trimmomatic (0.39)
  • FastQC (v0.11.9)
  • STAR (2.7.6a)
  • samtools (1.10)
  • bamtools (2.5.1)
  • Picard Tools (2.22.0)

Sequencing parameters:

  • library_type = PE
  • strand = reverse
  • ref_genome = mm10

For each sample, the following programs were run to generate the data necessary to create this report. Written as for unstranded paired-end data. For single-end reads, R2s and insert size metrics would be omitted.

java -Xmx1024m TrimmomaticPE -phred33 [raw_sample_R1] [raw_sample_R2] [sample_R1] [sample_R1_unpaired] [sample_R2] [sample_R2_unpaired] HEADCROP:[bases to trim, if any] ILLUMINACLIP:[sample_primer_fasta]:2:30:10 MINLEN:50

fastqc [sample_R1] [sample_R2]

cat [sample_R1/R2] | awk ’((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(count[read]==1){unique++}};print total,unique,unique*100/total}’

The following STAR options were used:

STAR –genomeDir [ref_genome_index] –runThreadN 12 –outReadsUnmapped Fastx –outMultimapperOrder Random –outSAMmultNmax 1 –outFilterIntronMotifs RemoveNoncanonical –outSAMstrandField intronMotif –outSAMtype BAM SortedByCoordinate –readFilesIn [sample_R1] [sample_R2]

Using aligned output files accepted_hits.bam and unmapped.bam:

samtools sort accepted_hits.bam accepted_hits.sorted

samtools index accepted_hits.sorted.bam

samtools idxstats accepted_hits.sorted.bam > accepted_hits.sorted.stats

bamtools stats -in accepted_hits.sorted.bam > accepted_hits.sorted.bamstats

bamtools filter -in accepted_hits.sorted.bam -script cigarN.script | bamtools count

samtools view -c unmapped.bam

java -Xmx2g -jar CollectRnaSeqMetrics.jar REF_FLAT=[ref_flat file] STRAND_SPECIFICITY=NONE INPUT=accepted_hits.bam OUTPUT=RNASeqMetrics

java -Xmx2g -jar CollectInsertSizeMetrics.jar HISTOGRAM_FILE=InsertSizeHist.pdf INPUT=accepted_hits.sorted.bam OUTPUT=InsertSizeMetrics (for paired-end library)

Summary Read Numbers

The number of raw reads correspond to those that passed Casava QC filters, were trimmed to remove adaptors by Trimmomatic, and were aligned by STAR to ref_genome+ERCC transcripts as reported in .info files. Unique read counts were obtained by using awk on trimmed fastq files. FastQC estimates of percentage of sequences remaining after deduplication were retrieved from fastqc_data.txt files. Bamtools statistics were based on sorted and indexed bam files. The mapped reads were those that mapped to reference and were output by STAR to accepted_hits.bam. The unmapped reads were output by STAR to unmapped.bam. Some reads may be mapped to multiple locations in the genome so that the number of total reads reported by bamstats may be greater than the number of raw reads. The Junction spanning reads are computed based on accepted_hits.bam CIGAR entries containing “N.” Related text files that were saved:

SRP201470 _read_counts.txt

SRP201470 _duplicates.txt

SRP201470 _unique_counts.txt

SRP201470 _bamstats_counts.txt

Total Number of Raw Reads Summary

Read counts are shown by per million reads.

Plot: Percentage of Unique Reads in Original Fastq File

Plot: Sequence Duplication Level

Bamtools Reads Summary

Bamtools Reads Summary As Percentage of Mapped Reads

Percentage of Mapped/Unmapped Reads

Plot: Percentage of Mapped/Unmapped Reads

Plot: Percentage of Junction Spanning Reads Among Mapped Reads

RnaSeqMetrics Summary

The Picard Tools RnaSeqMetrics function computes the number of bases assigned to various classes of RNA. It also computes the coverage of bases across all transcripts (normalized to a same-sized reference). Computations are based on comparison to a refFlat file. Related text files that were saved:

SRP201470 _rnaseqmetrics_summary.txt

SRP201470 _rnaseqmetrics_hist.txt

The Picard Tools RnaSeqMetrics function computes the number of bases assigned to various classes of RNA. It also computes the coverage of bases across all transcripts (normalized to a same-sized reference). Computations are based on comparison to a refFlat file. Related text files that were saved:

SRP201470 _rnaseqmetrics_summary.txt

SRP201470 _rnaseqmetrics_hist.txt

Reference Genome Mapped Reads Summary

Plot: Percentages of Total Mapped Bases Mapping to mRNA, Intronic and Intergenic Regions

Plot: Normalized Coverage

InsertSizeMetrics Summary

For paired-end data, the Picard Tools CollectInsertSizeMetrics function was used to compute the distribution of insert sizes in the accepted_hits.bam file and create a histogram. Related text files that were saved:

SRP201470 _insertmetrics_summary.txt

Plot: Median of Insert Size

Plot: Insert Size Distribution

Reads per Chromosome

Samtools produces a summary document that includes the number of reads mapped to each chromosome. Related text files that were saved:

SRP201470 _counts.txt

Mapped Reads to Reference Genome

Percent of Total Reads Mapped to Reference Genome

ERCC Spike-in Dose Response Plots

For samples that contained External RNA Controls Consortium (ERCC) Spike-Ins, dose response curves (i.e. plots of ERCC transcript FPKM vs. ERCC transcript molecules) were created. Ideally, the slope and R2 would equal 1.0.

Principal Component Analysis (PCA) Plot

  1. Compute PCs and variance explained by the first 10 PCs
Variance explained
PC Proportion of Variance (%) Cumulative Proportion of Variance (%)
PC1 86.55 86.55
PC2 5.472 92.02
PC3 3.703 95.72
PC4 1.473 97.19
PC5 0.9092 98.1
PC6 0.3561 98.46
PC7 0.316 98.77
PC8 0.2364 99.01
PC9 0.1429 99.15
PC10 0.1154 99.27
  1. PCA plots

PCA plots are generated using the first two principle components colored by known factors (e.g. treatment/disease conditions, tissue, and donors), visualizing similarities between arrays and these similarities’ correlation to batch effects.

HTSeq-count: No feature counts statistics

Numbers of reads that can not mapped to any feature (Nofeature count) are shown by per million reads from htseq-count quantification results

R version 4.0.3 (2020-10-10)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: DESeq2(v.1.28.1), SummarizedExperiment(v.1.18.2), DelayedArray(v.0.14.1), matrixStats(v.0.57.0), Biobase(v.2.48.0), GenomicRanges(v.1.40.0), GenomeInfoDb(v.1.24.2), IRanges(v.2.22.2), S4Vectors(v.0.26.1), BiocGenerics(v.0.34.0), knitr(v.1.30), ggplot2(v.3.3.2), DT(v.0.16), RColorBrewer(v.1.1-2), pander(v.0.6.3), tidyr(v.1.1.2) and rmarkdown(v.2.4)

loaded via a namespace (and not attached): locfit(v.1.5-9.4), Rcpp(v.1.0.5), lattice(v.0.20-41), digest(v.0.6.25), R6(v.2.4.1), RSQLite(v.2.2.1), evaluate(v.0.14), pillar(v.1.4.6), zlibbioc(v.1.34.0), rlang(v.0.4.8), annotate(v.1.66.0), blob(v.1.2.1), Matrix(v.1.2-18), splines(v.4.0.3), labeling(v.0.3), BiocParallel(v.1.22.0), geneplotter(v.1.66.0), stringr(v.1.4.0), htmlwidgets(v.1.5.2), RCurl(v.1.98-1.2), bit(v.4.0.4), munsell(v.0.5.0), compiler(v.4.0.3), xfun(v.0.18), pkgconfig(v.2.0.3), htmltools(v.0.5.0), tidyselect(v.1.1.0), tibble(v.3.0.4), GenomeInfoDbData(v.1.2.3), XML(v.3.99-0.5), crayon(v.1.3.4), dplyr(v.1.0.2), withr(v.2.3.0), bitops(v.1.0-6), grid(v.4.0.3), jsonlite(v.1.7.1), xtable(v.1.8-4), gtable(v.0.3.0), lifecycle(v.0.2.0), DBI(v.1.1.0), magrittr(v.1.5), scales(v.1.1.1), stringi(v.1.5.3), farver(v.2.0.3), XVector(v.0.28.0), genefilter(v.1.70.0), ellipsis(v.0.3.1), generics(v.0.0.2), vctrs(v.0.3.4), tools(v.4.0.3), bit64(v.4.0.5), glue(v.1.4.2), purrr(v.0.3.4), crosstalk(v.1.1.0.1), survival(v.3.2-7), yaml(v.2.2.1), AnnotationDbi(v.1.50.3), colorspace(v.1.4-1) and memoise(v.1.1.0)